All right, I've gotten myself composed and I'm ready to address questions on the system now.
I've had the time to develop a spreadsheet and modify (simplify!) the Elo method down to a basic score-tracking system which might prove useful to the GBA. It’s best to explain in practical terms, but there needs to be some mathematical intro, too. So I’m going to present the formulas, and then walk through a hypothetical series of matches between 6 GBA participants.
The core of the Elo ranking system is to predict what someone SHOULD score, using a quality score (ranking) between them and their opponent, and then recursively update the score based on how they actually performed, each time around. So if you have two equally-skilled players dueling, you would predict a 50-50 chance of one of the players winning the match. On the other hand, if someone is (let's say exactly) twice as good as another player, then it’s skewed to 67-33 chance of winning.
You can write a formula down, in terms of expected win percentage (P#) based on a consistent numerical rank (R#) of the two players;
P1 = R1/(R1 + R2)
P2 = R2/(R1 + R2)
And then obviously when you add up the win chances P1 + P2 it comes out to 1.
Now you take and ask the question, “what was the actual score (S) versus the predicted score (P)”? And here’s where we get into actually coming up with the basis for R# values. The beauty of this part as it applies to GBA rankings is it doesn’t actually matter what point system you use; for simplicity with the formulas in the practical example, I used the W=1, L=-1 system, and I disregarded ties (because I don’t think they should exist). But because the formulas deal with the score differences, and a scale factor (K) (which only exists to make numbers more readily understood), it can apply to any system of points, and it can accommodate ties.
The actual formula for how the ratings change after a match result is;
dR = K(S - P)
So in the 50-50 case, because that match could have gone either way, the actual score for the winner is 1, and their ranking will reflect that they “earned” half a win (+0.5*K) over what was expected. That’s uninteresting, it gets clearer why this is a good system when you consider the two possibilities of the 67-33 match. If the favored player wins, his predicted score tells him he’s already 2/3 of the way to victory, so his ranking will change by what he “earned”: 1/3 of a win (+0.33*K). On the other hand, if the underdog wins, the formula shows that was a much tougher match, so his ranking goes up by a larger amount (+0.67*K). In other words, when you win against people who are further below your perceived “skill level”, you will get less improvement on your rank. On the other hand, if you are playing against someone who’s “out of your league”, you’ll be heftily rewarded for an upset. The upper and lower bounds of the system will tend to level out as they approach limits, which depend on choice of K.
Most importantly, within a very short number of matches, the rankings will stratify and you will get distinctions with a much finer level of granularity than the system being proposed, where 4-2 = 2-0 as far as points total goes.
Trivially, if there's no other parts added to the formula, you can see that dR(loser) = -dR(winner).
So, enough theory. Let’s consider the example of Players A, B, C, D, E, and F. Each one picks an opponent and I’m going to go until 3 sets of 3 matches are played (in practice, you will update the score after each match, not waiting on anyone else to complete their games, but this is a fairy tale example. Deal with it.). I wanted the limits to be approximately bounded by 100 and 0, so a good number to use for K is 25. Everyone starts at 0-0 and is given R# = 50. The results per “round” are shown below. At the beginning, it looks like this:
Round 0 Player | Wins | Losses | Score | Rank |
A | 0 | 0 | 0 | 50.0 |
B | 0 | 0 | 0 | 50.0 |
C | 0 | 0 | 0 | 50.0 |
D | 0 | 0 | 0 | 50.0 |
E | 0 | 0 | 0 | 50.0 |
F | 0 | 0 | 0 | 50.0 |
The matches played are;
A vs B
C vs D
E vs F
Since everyone was evenly-matched, we get the uninteresting result. Bear with it, it gets better from here. Winners get +12.5 and losers get -12.5. And the bracket then looks like this:
Round 1 Player | Wins | Losses | Score | Rank |
A | 1 | 0 | 1 | 62.5 |
C | 1 | 0 | 1 | 62.5 |
E | 1 | 0 | 1 | 62.5 |
B | 0 | 1 | -1 | 37.5 |
D | 0 | 1 | -1 | 37.5 |
F | 0 | 1 | -1 | 37.5 |
The next set of matches played are;
A vs C
B vs D
E vs
FNow we’ve got some different numbers coming out:
Round 2 Player | Wins | Losses | Score | Rank |
A | 2 | 0 | 2 | 75.0 |
F | 1 | 1 | 0 | 53.1 |
B | 1 | 1 | 0 | 50.0 |
C | 1 | 1 | 0 | 50.0 |
E | 1 | 1 | 0 | 46.9 |
D | 0 | 2 | -2 | 25.0 |
This is where it’s starting to get interesting; there’s variations in the players with equivalent score based on the perceived difficulty of their wins. Because E beat F in round 1, F goes into the “grudge match” with a perception of a harder opponent and comes out more rewarded for it. It's a curious little outcome, and it's probably not a very meaningful difference here, but it is a distinction worth pointing out. Meanwhile, the winner of a match where both players have the same rank (B & D being 0-1, A & C being 1-0) gets the same points change as in the Round 0 case, so that’s how B and C got back to 50.
Final round;
A vs F
B vs
CE vs D
Gives us:
Round 3Player | Wins | Losses | Score | Rank |
A | 3 | 0 | 3 | 85.4 |
C | 2 | 1 | 1 | 62.5 |
E | 2 | 1 | 1 | 55.6 |
F | 1 | 2 | -1 | 42.8 |
B | 1 | 2 | -1 | 37.5 |
D | 0 | 3 | -3 | 16.3 |
And this is a fully-distinguished bracket, if you go by rank. A obviously is the “man to beat”, poor D is the “whipping boy”. Scores told you that much. Scores don’t tell you who’s had easier matches to get to 2-1 or 1-2, but the ranks do tell you that. C keeps playing opponents who are the “same” level as him, so so far he has just wavered between 50.0 and 62.5. E lost the grudge match, so he decided to take it easy this time and beat up on D, hence his score doesn’t rise as much, because the ranks can tell what everyone knows: D is an “easy target”. F has lost to A (who can blame him? That guy’s unstoppable) so he doesn’t fall as low as B, who lost to someone about equal to his skill level.
What this example shows you is that you get more information when you ask the question “who did they beat?” instead of “did they win or lose?” The pattern you might be seeing as well is that the player values will tend to oscillate – quite wildly at first, before it becomes apparent from the outcomes who’s a good player or not, but then as more information emerges it will tend to hone in on a value which ought to be pretty indicative of someone’s “true skill” at a certain point in time, without making it difficult to improve your standings.
It also shows that if someone near the bottom of the ladder beats someone near the top of the ladder, the top person will fall in ranks more than in score. That's appropriate, and it makes more sense. It helps everyone in the ladder this way. People at the bottom of the ladder can also leap-frog over other folks without beating them directly, so people who are really getting better will show that improvement more quickly. It makes the competition fiercer.
The fact that it is formula-based and score-independent will open up the possibility to add in some pretty useful dynamic effects, too. You can easily imagine adding in something to the effect of, "if you were the player who issued the challenge, you were probably slightly favored to win" and that will help distinguish the rankings even further.
And consider: if you went lifetime 22-5, you earned 17 points. Nobody takes that away from you. But if you decide to retire from the site, I don’t think you deserve a spot that high up the rankings list anymore, right? Hall of fame, ok sure. But…the rankings system can do that, automatically, without needing to remove names from the list. You just add in a “time decay”. A formula to the effect of “if a player goes inactive, they should gradually lose points on their ranking such that it would return to 50 over the course of 365 days with no duels completed” and which is very simple to incorporate. This makes it so that it the top of the ladder doesn’t get saturated by inactive people who did well and then disappeared, and maintains incentive for people at the top of the ranks to keep going if they want to maintain.
Another improvement you can think about is putting a match-total dependence on the K-value. As we saw, it’s wild in the beginning, and then settles down a bit as the rankings start to accurately portray people. So maybe you say, “for the first 5 games a player duels in, K=10. Then, we up the ante to K=25 for the full-fledged players”. (This also helps discourage two negatives of the rank system: veteran players beating up on newbies [who will, statistically, enter the GBA with a better rating than half of all prospective duelists], and newbies instantly scalping the top of the ladder [i.e., a very good duelist who’s not actually dueled in the GBA yet comes in and targets #1, being able to climb the ladder really quickly]) Again, trivial to implement a formula that accomplishes this.
In a world where Google Sheets exists, it ought to be pretty easy to get the GBA staff access to a single spreadsheet containing all the formulas so that all you’ve got to do is put in a match result, put in the rankings, and then sort/copy/paste the output here.
I've tinkered with stuff enough that I think I'm able to answer any further questions on the topic.